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Sparse regression and marginal testing using 
cluster prototypes 

Stephen Reid* and Robert Tibshirang 


Abstract 


We propose a new approach for sparse regression and marginal testing, 
for data with correlated features. Our procedure first clusters the features, 
and then chooses as the cluster prototype the most informative feature in 
that cluster. Then we apply either sparse regression (lasso) or marginal 
significance testing to these prototypes. While this kind of strategy is 
not entirely new, a key feature of our proposal is its use of the post¬ 


selection inference theory of Taylor et al. (20141 and Lee et al. (20141 to 


compute exact p-values and confidence intervals that properly account for 
the selection of prototypes. We also apply the recent “knockoff” idea of 


Barber & Candes (20141 to provide exact finite sample control of the FDR 


of our regression procedure. We illustrate our proposals on both real and 
simulated data. 


1 Introduction 


We consider the linear model setup: 

y = Xj3 + e 


( 1 ) 


where y is the nxl response vector, X an n x p matrix of predictors, /3 a p x 1 
vector of true regression coefficients and e an nxl vector of errors - usually 
assumed to have n-dimensional Gaussian distribution N(0,a 2 I). Although not 
critical for the ideas of this paper, we are especially interested in the p > n case. 
These types of predictor matrices are often encountered in practice and provide 
interesting case studies for our methods. Since our matrix is not of full column 
rank, least squares regression cannot be used for further analysis of the problem. 
Instead we proceed with a regression method that comprises of both variable 
selection and parameter estimation components. Examples include the lasso of 


Tibshirani (1996), the elastic net of Zou & Hastie (2005), principal component 


regression and forward and backward stepwise regression. We focus primarily 
on the lasso. 
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Furthermore, suppose that the columns of X share substantial empirical 
correlation - artifacts of a hypothetical underlying covariance matrix for the 
columns that exhibits significant block structure. Such correlation amongst 
the columns of the predictor matrix poses a thorny set of problems, including 
numerical instability of regression solutions, parameter near-nonidentihability 
and the inability of the lasso to recover the true set of signal variables. The lasso 
essentially selects randomly among a set of highly correlated signal variables, 
entering only one of them. Another, largely unaddressed, issue is that of a lack 
of interpretability. The lasso merely provides a set of predictive variables. It is 
silent on the grouping (correlation) structure amongst the variables or indeed 
the estimated effect on the response of those unselected variables. Of course, 
this tool is not designed for such a purpose. 

Our goal in this paper is to provide the owner of a wide dataset, exhibit¬ 
ing significant correlation amongst its columns, with a procedure to discover 
the column groupings and a single representative prototype from each group. 
Subsequent sparse regression or marginal testing is then performed on these 
prototypes. Critically, our method (described in subsequent sections) allows us 
to leverage the powerful results of Lee et al. ( |2014 ) and Taylor et al.| ( 2014),and 
subsequent work by |Lee fe Taylor ( 2Q14[ > to provide valid confidence intervals 
and p-values (testing for nullity) of the effect sizes of these prototypes. This is 
true even after their selection as prototypes and their participation in subse¬ 
quent sparse regression. As an additional feature, we can use the group structure 
learned in the first step and generate similar confidence intervals and p-values 
for those variables not selected, but correlated with, the prototypes. Note that 
this provides a highly interpretable snapshot of the data: a grouping structure, 
effect size estimates of most predictive prototypes of the grouping structure as 
well as for the also-rans in each of these groups. 

Our proposal comprises of a series of distinct steps: 


1. Grouping the columns of the predictor matrix X (assumed fixed), using 
either pre-defmed groups from the problem context or a clustering method. 

2. Prototype extraction, one from each cluster, by screening on the marginal 
correlation each cluster member has with the response y. 

3. Subsequent regression analysis on the selected prototypes, here specif¬ 
ically the lasso, what we call the Protolasso, or marginal testing of the 
prototypes, what we call Prototest. We use the theory of post-selection 
inference to obtain exact p-values and confidence intervals that properly 
account for the selection at every stage. 


As an illustration of the interpretive richness of our method, consider Fig¬ 
ure [l] The caption describes how data was generated. A single pair of X and 
y was generated. To this data we fit the ordinary lasso, used cross-validation 
to select the optimal number of variables and used the post-selection inference 
tools of Lee et al. (2014) to generate confidence intervals for the selected vari¬ 
ables. We also subjected the data to our protolasso procedure. We describe 
this procedure in later sections. Some observations: 
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Figure 1: Confidence interval plots for selected variables from ordinary lasso 
(top panel) and protolasso (bottom panel). Predictor matrix X has n = 50 
and p = 100 columns, generated to occur in groups of size 10 (columns 1 to 10, 
11 to 20, 21 to 30, etc.). Predictors in each group share pairwise correlations of 
size p = 0.5 amongst themselves, but are uncorrelated with all other predictors. 
Crosses represent the true target of the confidence interval (the partial correla¬ 
tion of the appropriate predictor with the response, given the selected model). 
Cross colours represent the grouping detected by each method. The ordinary 
lasso does not detect groups, so all crosses have the same colour. Confidence 
intervals are represented by dashed vertical bars and horizontal endpoints. Red 
confidence bars represent those for the selected predictors/prototypes, grey con¬ 
fidence bands represent intervals derived by swapping out the group prototype for 
the variable of interest (described later). The ordinary lasso does not tell us any¬ 
thing about group structure, precluding the swapping of other group members for 
the prototype, hence all the confidence intervals are in red. If predictors/clusters 
prototypes were not selected, we do not construct a confidence interval for them 
- hence the empty spaces. Dashed blue vertical lines show the indices of vari¬ 
ables that were given non-zero signal in the original fi vector, with parameter 
value set to 2. 
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• Ordinary lasso selects seven variables and constructs confidence intervals 
for them. The three true signal variables (1, 11 and 21) are selected. 

• Confidence intervals for the ordinary lasso seem rather wide, especially 
when comparing to the lower panel. 

• The ordinary lasso does not detect any group structure in the predictors. 
All crosses in the top panel are of the same colour. 

• Protolasso detects the group structure in the columns (in fact, recovers 
it exactly). Notice the different colour crosses in the lower panel. 

• Protolasso selects four prototypes (variables 1, 11, 21 and 94) and con¬ 
structs confidence intervals for them (red confidence interval bands), cap¬ 
turing all three true signal variables. Notice that the confidence intervals 
seem much narrower. 

• The grouping structure captured by protolasso allows us to construct 
confidence intervals for the non-prototypes in the groups of the selected 
prototypes. These are in grey. One can then ascertain the effect of re¬ 
placing the prototype by the non-prototype of interest. Note that these 
confidence intervals are still valid and take the entire selection procedure 
into account. Clearly such an analysis is impossible when using ordinary 
lasso. 

In subsequent sections, we discuss some related work, the detection of the 
group structure (first phase) and the tools used in the construction of our post 
selection confidence intervals and p-values. We conclude with some simulations 
and applications to real datasets. 

This paper is organized as follows. In section [2] we review some related work 
in the literature. Section [3] describes the clustering of the features and prototype 
extraction. In Section[4]we review the theory of post-selection inference and give 
details and examples of our Protolasso proposal. The Prototest of Section [6] 
carries out marginal testing of the selected prototypes. We discuss FDR control 
for protolasso via knockoffs in Section [7] Appendix A gives details of the 
proposed gap statistic for choosing the number of clusters. We conclude with 
discussion in Section [9j 


2 Related work 


The lasso has become a widely used tool for linear regression with simultaneous 
variable selection. It can be applied to matrices with p > n. There has been 
much theoretical development of the method’s variable selection and screening 


properties, under a variety of assumptions. References include Meinhausen & 

Buhlmann 

(2006), Zhao & Yu 

2006), |Van de Geer (2007) 

Zhang & Huang 

(2008) 

, Van de Geer & Buhlmam 

i 2009j), Meinhausen & B. ( 

2009), Bickel et al. 

to 

o 

o 

CD 

and 

Sun & Zhang (2011). 
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These positive findings are heartening. However, it has been noted that 
the lasso performs poorly in the presence of predictors with high empirical cor¬ 
relation. In particular, it selects but a single variable into the model from a 
set of mutually highly correlated variables, even if the omitted ones are true 
signal variables. Other methods like the elastic net of Zou & Hastie (2005), 
OSCAR of Bondell & Reich (2008) and the clustered lasso of She (2010) have 
been developed to address this shortcoming. These offer improvements, but do 
not explicitly take the correlation amongst the variables into account. All of the 
methods consider different penalty terms in order to encourage more acceptable 
behavior in the face of highly correlated predictors. 

Another set of methods cluster the variables first and then fit a model. Some 
methods do these two steps sequentially (clustering first and then fitting a model 
to some cluster representatives); others simultaneously. Examples of the former 
include principal component regression, gene shaving of Hastie et al. (2000), 
tree harvesting of Hastie et al. ( |2001 ) and the canonical correlation clustering 
and subsequent sparse regression of Buhlmann et al. (2013). The method of 


Dettling & Buhlmann (2004) and OSCAR represent the latter. 


The above methods all attempt to find clusters of variables and combine 
variables within a cluster to support good predictive ability in the subsequent 
(or concomitant) model fit. Our method also proceeds sequentially by first 
clustering variables into highly correlated groups. A second step differentiates 
our method from the others: once the clusters have been found, we extract 
a single representative prototype from the cluster membership. We do not 
combine all members of the cluster by averaging or projecting onto principal 
component directions. The prototypes are chosen so as to preserve subsequent 
predictive ability as much as possible. 

Once we have constructed clusters and extracted prototypes, the user can 
proceed as they wish. We, however, follow Buhlmann et al. (2013) and perform 
a sparse regression after prototype selection. The novelty of our method (and 
the source of its rich interpretability) is our liberal use of the post selection 
inference framework of Taylor et al.| (2014) and Lee et al. (2014). Our clustering, 
prototyping and subsequent sparse regression procedures are all chosen so as 
to ensure adherence to their framework, enabling the computation of a set of 
confidence intervals and p-values meant to give deeper understanding of the 
grouping structure, the significant prototypes and their potential doppelgangers. 
We discuss the clustering algorithm next. 


3 Clustering and prototype extraction 

The first step in our procedure is the estimation the grouping structure of the 
features. In some problems these groupings may be pre-defined, for example, 
gene sets in microarray studies, organizing genes into functional units. In this 
case our procedure will make use of these groups. 

In many cases, however, no a priori grouping is available for the features and 
unsupervised clustering tools can be applied. Fortunately, the post selection re- 
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suits of and Lee et al. (2014) (the backbone of the ultimate interpretability of our 
method) all go through, conditional on the predictor matrix X. Any clustering 
performed on the columns of A' need not be taken into account when construct¬ 
ing these inferences later on, provided the clustering is truly unsupervised and 
receives no input from the response y. 

As such, we do not prescribe to the user any particular clustering method. 
We are free to use k- means, fc-medoids or any of the hierarchical clusterings: 


complete, single and minimax linkage (Bien & Tibshirani (2011)), for example. 


We proceed with hierarchical clustering methods. Minimax clustering has the 
added advantage of returning a clustering of the variables and a prototype for 
each cluster. These prototypes are determined solely on merit of A, with no 
input from y and so may lack some predictive power. They could be used 
immediately in further analysis, but we propose a different set of prototypes, 
that can be extracted from the output of any clustering method. They are 
described in a subsequent paragraph. 

Since we are concerned with empirical correlation, we prescribe that the 
dissimilarity matrix input to these clustering methods be d, a p x p matrix 
with (j, k) th entry: 


Djk = 1 - 


^2i=li x ij x j)( x ik x k) 


i.e. one minus the sample correlation of columns j and k where Xj = ^ E"=i x ij- 
Of course, each of the clustering algorithms require the user to set the number 
of required clusters as a parameter. Since our method forms somewhat of a 
pipeline of analyses, we do not want this step to be a bottleneck. We propose 
a method for the automatic selection of the number of clusters. Our proposal 


is a gap statistic procedure very similar to that of Tibshirani et al. (2001). 
To maintain the flow of the exposition, we defer details of this procedure to 
Appendix A. 


3.1 Prototype extraction 

Having identified clusters, the next step is to extract prototypes - one from each 
cluster. By “prototype” we mean a single cluster representative, chosen from 
among the members of the cluster. We exclude the possibility of using a point 
wise mean or first principal component of the points in the cluster. 

Some clustering methods produce such prototypes as a side effect of the al¬ 
gorithm. Examples of this are fc-medoid clustering and hierarchical clustering 
with minimax linkage, as developed by Bien & Tibshirani (2011). These meth¬ 
ods produce prototypes in an unsupervised fashion, with no recourse to the 
response y. Although undoubtedly a good feature of these methods, our goal is 
to extract prototypes more explicitly tied to their correlation to the response. 
These in some sense represent the “most interesting” members of the cluster 
when considering the response as well. 
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Once we make use of y in our prototype selection, and proceed with these 
prototypes in further analysis, any inference downstream must take this selec¬ 
tion effect into account. Luckily, we can leverage the post selection inference 


framework of Lee et al. (2014) to account for this selection effect, provided the 


prototype selection method is simple enough, i.e. linear in y , after conditioning 
on minimal additional information. One possibility would be to use the “least 
squares prototype”— that is, regress y on the features in the cluster and used 
the fitted values as the “prototype”. This would capture the joint signal in all 
of the cluster members, and would be particularly appropriate when modeling 
genesets in genomic studies. 

However in this paper we focus on simple marginal correlation screening. 
In particular, given a clustering with K clusters, written as {Ci, C 2 ,..., Ck} 
where U ^ =1 Ck = {1,2 ,p} and Cj D Ck = 4> for j ^ k, we extract prototypes 
P = {A, P' 2 , ..., Pk} where each A £ {1, 2,... ,p} and A £ Ck such that 


Pk = argrnax. 


jdiCk 


\Xj y I 


It turns out that the selection of prototypes in this way is easily translated 


into the framework of Lee et al. (2014) and hence easily incorporated into subse¬ 


quent inference. Even at this juncture, it is possible to make inferences (perform 
significance tests and construct confidence intervals) for these prototypes. Par¬ 
ticular examples include inferences on the individual marginal correlations with 
y and partial correlations if all (or some) of the prototypes are used in a subse¬ 
quent least squares regression. Details of post selection inference are discussed 
in the next section. 

Figure[2]demonstrates that the variable selection performance of the protolasso 
procedure (clustering and then marginal correlation screening in each cluster) 
is not much worse than that of two other selection procedures. The first of the 
other methods is a marginal screening method, each time selecting as many vari¬ 
ables as there are clusters in the protolasso method, retaining those with the 
largest absolute marginal correlation with y. The second is the lasso, applied 
to the entire dataset, with the regularization parameter chosen so as to ensure 
the same number of variables is selected. 

The lasso performs admirably, over all correlations - a function of the sheer 
size of the signal. The protolasso procedure is not far behind, selecting the 
non-prototypes in the signal clusters far less often than does the other two 
methods, while selecting some of the noise variables more often. This is a 
function of selecting a prototype from each cluster, many of which are filled only 
with noise variables. Selection performance degrades at very high correlations. 
However, at these correlations it is very difficult to distinguish between signal 
and non-signal variables in the same cluster anyway. 

The reader should note that our method is not designed to be competitive 
in the selection of signal variables. There are other methods more effective at 
this endeavor. The merit of our method is its rich interpretability. We show 
the output of this particular simulation merely to demonstrate that variable 
selection performance is not too critically compromised. 
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Figure 2: Dataset has n = 50 rows and p = 200 columns, divided into 20 groups 
of 10 columns each. Columns within a group share pairwise correlation p, which 
is set to p = 0.1, 0.2,..., 0.9 from the top left to the bottom right. Horizontal 
axes measure the variable number, while vertical axes measure the proportion of 
times out of S = 80 simulation runs in which each variable was selected by each 
of the methods. Black circles represent the protolasso procedure, red triangles 
the marginal screening procedure and the green plusses the ordinary lasso fit to 
all the variables. (3 has 3 non-zero entries - all set to 2 - at variables 1,11 and 
21. These are indicated by the three vertical grey dashed lines. 































4 Post selection inference 


Taylor et al. (2014) and Lee et al. (2014) present a powerful framework for 
inference after variable selection that takes the response into account. They 
consider the usual linear model, as in Equation ([I]), assuming in particular that 
JV(0,E). Let p = X/3. 

Their focus is inference on linear combinations rj 1 p after the application of 
a variable selection technique. Variable selection - a process usually privy to 
the response y - complicates post fit inference. Given that a set of variables 
M C {1,2,... ,p} has been selected (often with \M\ < n ), they argue correctly 
that the usual Gaussian least squares inference on (3^ = X^p, is invalid. Here 
the subscript denotes the restriction of the columns of X (or the entries of j3) to 
those in the set M. The superscript + denotes the Moore-Penrose of the matrix 
in question. 

Their main results pertain to the lasso. They show that, for fixed V, and 
conditioning on both the variables selected by the lasso M = M and the signs 
sm of the non-zero /3, we can construct matrices A = A l ^ s s ° and b = b l ^f 8 ° such 
that the selection event {M = M, s^ = Sm} can be written as a set of linear 
constraints on y: 

aIclsso ^ hiasso 


j M,s m 


Here SM,i = sign(/3M,j)- The authors give explicit formulae for A and b in their 
paper. The reader is referred there for details. 

This result can then be used to do inference on 77 T y post selection, by con¬ 
sidering the distribution of: 

y T y\Ay < b 

This allows us to characterize the conditional distribution exactly: 


F 


[V-,V+] / T 


where 


r] T fi,ij T Y]r) 


F [ ;*hx) = 


W y)\ A V < b~ Unif(0,1) 




[ 1,(7 


$ 




the truncated Gaussian cumulative distribution function (cdf). Again, the exact 
forms of V~ and V + are given in the paper. 

Note that the true utility of this result extends beyond a mere applica¬ 
tion to the lasso selection procedure. Indeed, any selection procedure that, 
with perhaps a little extra conditioning information, can be written as a set 
of linear constraints on y can be subjected to this framework. Other examples 
of such procedures are marginal correlation screening, as described in |Lee fe| 


Taylor (2014) and, in the multivariate mean vector estimation literature, the 


Benjamini-Hochberg procedure and the selection of the largest K elements of 
y, as described in Reid et al. (20141. 


For a given procedure, the challenge is to write the selection constraints it 
imposes as a set of linear constraints on y (if possible). This gives one the 
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forms of A and b specific to that method. Coupled with a particular contrast of 
interest y, which can be chosen dependent on the selected variable set M, one 
can then find particular values for V - and V + . With the addition of observed 
response data y and an assumption of known variance E, the distribution of 
the above is fully characterized, allowing p-values and confidence intervals to be 
computed via evaluations or inversions of the truncated Gaussian cdf. 

Our method of clustering, prototyping and subsequent sparse regression on 
prototypes is easily cast into this framework. In fact, the same framework can 
be used to do inference even on non-prototypes, leading to richer interpretation 
of the dataset. This is discussed next. 

4.1 Details of post selection inference for Protolasso 

Recall the steps of the protolasso and prototest procedures given in the 
Introduction, and summarized here for reference: 


The Protolasso and Prototest procedures 

1. Group the features: use either a pre-defined grouping of features, or cluster 
the features into groups. 

2. Form prototypes: extract a prototype feature from each cluster, by choos¬ 
ing the feature with highest marginal correlation with the response y ; 

3. Model fitting and inference: carry out a regression analysis on the se¬ 
lected prototypes (protolasso) or marginal testing of the prototypes 
(prototest). Use the theory of post-selection inference to compute exact 
p-values and confidence intervals that account for the selection effects. 


We assume the predictor matrix X fixed. This is also the assumption in the 
post selection inference framework discussed above. Note that if one assumes 
X fixed and if the clustering performed on the columns is completely unsuper¬ 
vised (including the method used to estimate the number of clusters), then the 
post selection distributional results remain unaffected. Identifying the grouping 
structure in the fixed X has no effect on the linear constraints on y. A new 
replication of y will not change the grouping structure so identified. The first 
step need not be accounted for in the post selection framework. 

The second step does involve y and thus contributes some linear inequalities 
to the post selection inference. Suppose we are give n a cluster C i, its prototype 
Pi and the sign = sign(a;T y ). According to 


Lee & Taylor 


(20141, the con¬ 


straints contributed by selecting the prototype in this cluster can be represented 
by the matrices 



— s 


(i) 


— s 


(i) 


1X l 

1X l 


( 2 ) 
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«- 


( 3 ) 


Collecting the constraints for all the clusters, the entire prototyping step con¬ 
tributes constraint matrices 


4” = 


4 ’ 1 

4 


A 


(K) 

~Pk 


(4) 


and 


(,<■> = 


4 \ 

4 


4’ / 


(5) 


where we condition on the set of selected prototypes and the signs of their 
marginal correlations with y , i.e. the event {P = P, 4 1 •* = s} where = 


(4 


(!) J 2 ) 


S {K) 

’A' 


)• 


Finally, the third step proceeds by fitting the lasso on the columns of X 


p- 


Following Lee et al. (2014), we can construct the constraint matrices for this 


regression for a fixed regularization parameter A, once we condition on the set 
of indices with non-zero coefficients M C P and the signs of these non-zero 
coefficients §( 2 \ On the conditioning event {P = P,M = M, = s^}j we 
have the constraint matrices 


/ 


A< 2 > = 


\Xp\ M {I — Qm ) 
~\ X p\m4 ~ Qm) 


\ -diag(s (2) ) (xJjXm) 


-1 


X 


M 


= 


1 X P\M 


(X T ) + S (2) 
1 + X4 *AX T ) + S (2) 


L P\M\ 

-A • diag(s (2) )(X^XM) _1 s (2) 


v -1 


xT m- 


where Q M = X M (XjjX M ) 

Overall then, if we condition on the event {P = P,M = M, s^ 1 ) = s ( 1 ) j §( 2 ) = 
s^ 2 l}, we have the post selection distribution 


v T y 1 


dv-,v+] 


( 6 ) 


where V and V + are gleaned from A = (AWJ, A.( 2 ) t ) T , b = fo^ 2 ^ T ) T , 

s^ 1 ), s 01 and 77 as described in Lee et al. (2014). Assuming that a 2 is known, 


we are free to do inference on rj y. 
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4.2 Inference for the selected prototypes 

Suppose that we have run the protolasso procedure and are interested in the 
inference of the partial correlation of selected prototype Pk with the response y 
i.e. the regression coefficient for that prototype obtained from the regression of 
y onto the selected prototypes. In this case, we choose y = where j is 

the index in M associated with Pk, giving us values for and V + . 

P-values for this coefficient r/ T y, (when testing for nullity) are gleaned by 
evaluating the cumulative distribution function described in Equation <©• Con¬ 
fidence intervals are obtained by inverting the same function, solving for ?? T /r, 
given data r] T y. Indeed, this is how we obtained the red confidence intervals in 
lower panel of Figure [l] 

4.3 Inference for other members of selected clusters 

We need not only limit ourselves to inference on the selected prototypes alone. 
The post selection inference framework allows us to say something about the 
other members of clusters of the selected prototypes. Suppose that the sec¬ 
ond stage lasso selected prototype (amongst others). The first stage of the 
protolasso procedure gives us an estimated group structure. It could be of 
interest to say something about the effect size of one of the other members of 
cluster Ck when we swap it for the selected prototype, but keep the prototypes 
from other selected clusters in the model. 

Formally, suppose we have selected prototypes with indices in M and that 
Pk G M. Furthermore, suppose that Ck has elements ji,j 2 ,---,j\c k \ and we 
want to exchange ji ^ Pk for the the prototype. We can perform inference in 
this case by merely selecting the correct form of r). In particular, let 

71 = ^MVIPjluFi})/ 

where again j is the index in M corresponding to Pk- Having defined an 7], 
we use the A and b matrices determined earlier to find values for V~ and V + 
and are then free to proceed with inference. This is how we obtained the gray 
(non-prototype) confidence intervals in the lower panel of Figure [l] 

To convince the reader that these procedures are indeed valid, consider the 
dataset described in Figure |Tj 

We generated S = 100 replications of this dataset, each time construct¬ 
ing two responses: one with zero signal /3 = 0 and one with non-zero signal 
/3i = /3ii = P 21 = 2. We used the gap statistic to select clusters and set the reg¬ 
ularization parameter in the second step lasso to ensure that three prototypes 
were selected. We then computed p-values, under both signal regimes, for the 
selected prototypes (left panel Figure [3j) and all other members of clusters with 
selected prototypes (right panel Figure [3]). Notice that the p-values behave as 
one would expect - they have uniform distribution for the zero signal (/3 = 0) 
case and a sub-uniform distribution when there is non-zero signal. 
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Selected prototypes 


Other members of selected clusters 





Figure 3: Dataset as in F%itre[7| Top row: Sorted p-values (testing for nullity) 
versus uniform order statistics for selected prototypes (left) and other members 
of selected clusters (right) when fj = 0. Notice the close concordance with the 
45 degree line through the origin, as suggested by the theory. Bottom row: 
Same, but for fl\ = flu = fl^i = 2. Notice the sub-uniform behavior of the 
p-values. 


Remark. We note that other supervised learning procedures like forward 
stepwise regression or LAR could be applied to the prototypes: the post¬ 
selection inference theory is similar, and would follow the development in Taylor 


et al. (2014) 


5 Simulation study 

In this section, we subject our method to a simulation study. We observe how its 
performance compares to that of the lasso and marginal screening (both without 
an initial clustering step). Performance is measured on datasets exhibiting some 
grouping structure. The correlation within groups is a variable we control to 
ascertain whether performance differs with changes in strength of the grouping 
structure. 


5.1 The setup 

At each setting of the simulation parameters, we generate a single matrix X 
once. It has n = 50 rows and p = 100 columns. We consider two different kinds 
of grouping structures (discussed presently). Intra group correlation is set to 
p = 0.1, 0.4 and 0.7 respectively for all pairs of variables within a group. Dif¬ 
ferent intra group correlations allow us to simulate varying degrees of grouping 
structure strength. Inter group correlation amongst variables is zero. 
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Once we have generated the predictor matrix (centered and scaled so that 
each column Xj is such that 11 11 2 = n), we generate S = 100 replications of 

the response y according to Equation ([!]), with cr = 1. 

We consider two types of grouping structures, each with their own configu¬ 
rations for fi: 

1. Block diagonal: The p = 100 variables are divided into 10 groups, each of 
size 10. Three fi configurations are considered: 

(a) Single signal per cluster: fii = /3n = fi 2 \ = fisi = fia = fisi = /3* ^ 
0, with all other fi entries zero. /3* is set at our discretion with its 
current value always explicitly stated in the relevant output. 

(b) Paired signals: fii = fi 2 = fin = fii 2 = fii\ = fill = fi *■ Note that 
this configuration might confuse the lasso for larger configurations. 

(c) Tight signal: fii = fi 2 = fis = fii = fib = fie = fi*. This could be 
even more confusing for the lasso for larger correlations. 

2. Single block: There is one large block of correlated variables numbered 
from 1 to 50. All other variables are uncorrelated (i.e. blocks of size 1). 
Two fi configurations are considered: 

(a) Tight signal: fii = fi 2 = fib = fii = fib = fie = fi*- 

(b) Split signal: fi 1 = fi 2 = fi 3 = fi 51 = fi 52 = fi 53 = fi *. 

In the interest of saving space, we do not present all output for all configurations. 
Often performance follows a similar theme for different configurations and only 
one needs to be presented in detail. 

5.2 Performance measures 

The utility of our method hinges on interpretability and post-selection inference. 
As such, we consider two broad metrics of success: variables entertained and 
confidence interval width. 

We have already alluded to how our method can be used to make inference 
about not only the selected prototypes, but also those variables within the same 
cluster, once they are switched in for the prototype. Even if not directly selected, 
such a variable can still be entertained for further analysis. 

Each fi configuration defines a set Sq of indices corresponding to non-zero 
entries. If our method is to be successful, we wish is to return an “entertained 
set” S that captures as much of Sq as possible. We define S as the set of 
prototypes selected in the second stage lasso fit and all the members of the 
clusters of these prototypes. The entertained sets for the other two methods 
are the set of indices corresponding to non-zero coefficient estimates (for the 
lasso) and the largest marginal correlations with y (for marginal screening) 
respectively. The absence of an estimated group structure precludes the addition 
of any further elements to the entertained sets of these latter two methods. 
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A method generates an entertained set S and its entertained proportion is 
defined to be 

|5o| 

For each simulation run, b , we obtain the entertained proportion for each of the 
three methods: EP pl , EPj; asso and EP™ S . Since interest focuses on the relative 
performance of our method, we consider the measures: 

1 S 

-^(EP pl - EP^ so ) 

6=1 

1 s 

- J2( EP b l - EPD 

° b—1 

These are plotted in Figures [4j [5] and |6j for example. Notice that these mea¬ 
sures vary between -1 and 1, with larger (more positive values) implying that 
the protolasso procedure entertains a greater proportion of true variables on 
average. 

We also measure the confidence interval width of the confidence intervals 
constructed for each of the methods. Each method produces a set M of selected 
variables (here only the selected prototypes for the protolasso procedure). 
The post selection inference framework allows us to construct intervals for the 
elements of (XT ) + /x. Each method produces its own A and b matrices. For 

each replication, we compute the confidence intervals for all the elements of M pl 
(only selected prototypes), M lasso and M ms , take the median interval width 
over these indices, giving MWj ’ 1 , MWl asso and MVF™ S , where “MW” is short 
for “median width”. Finally, we consider the measures 

median h MW^ asso 

medianbMW^ asso + median;,MW^ 

median b MW^ s 

medianbMW™ s + median^MW^ 

Note these measures vary between 0 and 1, with larger values implying that the 
protolasso method tends to produce narrower intervals (which is obviously 
better than the alternative). These are plotted in Figures [7] and [8j for example. 
We choose this specific form to ensure that the measure ranges within a limited, 
well defined range. Results are presented in a heat map representation. Colors 
are more easily mapped to limited range measures. 

Finally, we control the number of clusters extracted from the hierarchical 
clustering and adjust the regularization parameters in the subsequent sparse 
regression so as to fix the number of these clusters that are selected at the 
second stage. We control and vary these numbers, despite protolasso having an 
automatic mechanism for deciding on these numbers. The control and variation 
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of these parameters allow us to study the effect on our performance measures of 
allowing different numbers of variables into the analysis. At each replication, the 
preset number of clusters and selected clusters determine the entertained set of 
the protolasso procedure. We then ensure that the other two methods (lasso 
and marginal screening) have the same sized entertained sets , by adjusting the 
relevant regularization parameters. Methods can then be measured on equal 
footing. We considered numbers of clusters ranging between 8 and 12 and 
selected clusters ranging between 1 and 7. 

5.3 Entertained sets 



Number of clusters selected Number of clusters selected 


Figure 4: Heatmaps of entertained proportion performance measures as function 
of the initially extracted clusters (vertical axis) and those selected by the second 
stage lasso (horizontal axis). Dataset is the block diagonal X with the paired 
signal fd configuration. /3* = 0.2. Left panels: Relative to lasso on all 

variables. Right panels: Relative to marginal screening. Top row: Pairwise 
correlation of variables in groups p = 0.1; middle row: p = 0.4; bottom row: 
p = 0.7. Colour scale on the right of each plot. “More red” signifies that measure 
is closer to 1 (and protolasso outperforms the relevant method), while “more 
blue” signifies poorer performance relative to the other method. 

Figures |4] and [5] show the entertained proportion performance of protolasso 
against the other two methods for a reasonably small signal size /3* = 0.2. Some 
general observations: 

• Entertained proportion performance is fairly similar for the protolasso 
and marginal screening methods. We see this from the general green tinge 
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in the right column. This regularity seems to persist over different correla¬ 
tions, dataset types and signal configurations. Perhaps some improvement 
of protolasso as p increases. 

• Looking down the left panels, we see that protolasso performance rela¬ 
tive to the original lasso improves as p increases. Recall that Figure [4] is 
for the paired signal configuration. There are pairs of ever more correlated 
variables, both with signal. The lasso tends to select one or the other, but 
rarely both and one or the other may not make it into the entertained 
set. Protolasso, on the other hand, groups correlated variables together 
initially and tries to ensure that they do not both enter the second stage 
lasso. However, once we have selected any member of a cluster contain¬ 
ing these signal variables, both enter the entertained set. Protolasso 
then seems to counter the shortcoming of lasso by the nature in which it 
constructs its entertained set. 

• This effect is attenuated for the single signal case in Figure [5] Here we 
do not have correlated variables both getting signal. Each correlated clus¬ 
ter has at most one signal variable and the lasso is not so easily con¬ 
fused. Notice that protolasso and lasso perform similarly. Although not 
shown, the tight signal cluster accentuates the effect of the previous bullet: 
protolasso outperforms lasso to a greater extent, as all signal variables 
are correlated and in the same cluster. 

• As we increase the signal size to /3* = 2 in Figure|6j we see that all methods 
perform very similarly. At this point, all methods nearly always include 
all signal variables in their entertained sets. 

• Protolasso entertained proportion performance increases with the num¬ 
ber of selected clusters (along horizontal axis). Remember that the initial 
clustering imposes some restrictions on which variables make it into the 
entertained set. If we select only one cluster, but signals are distributed 
over more than one cluster, many signal variables do not make it into the 
entertained set. Since we allow the original lasso to select as many vari¬ 
ables as there are in the entertained set of the protolasso without regard 
to a clustering, it is free to select signal variables from other clusters not 
currently considered by the protolasso procedure. 

• The more clusters in the original clustering; the more potential proto¬ 
types. If we choose the number of clusters in excess of the true number 
of groups, we allow potentially correlated variables into the second stage 
lasso, leading to the same variable selection problems discussed before. 

• Results are qualitatively similar for the single block dataset. We omit 
these plots in the interest of saving space. 

Overall then, it would seem that protolasso includes the appropriate vari¬ 
ables in its entertained set. The slight relaxation implied by the definition of 
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Figure 5: Heatmaps of entertained proportion performance measures as function 
of the initially extracted clusters (vertical axis) and those selected by the second 
stage lasso (horizontal axis). Dataset is the single block X with the single signal 
(3 configuration. /?* = 0.2. Same setup as Figure^ 
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Figure 6: Heatmaps of entertained proportion performance measures as function 
of the initially extracted clusters (vertical axis) and those selected by the second 
stage lasso (horizontal axis). Dataset is the block diagonal X with the paired 
signal (3 configuration. (3* = 2. Same setup as Figure 
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the entertained set allows us to overcome the erratic variable selection issue 
encountered by the lasso amidst high correlations. 

5.4 Confidence interval width 

The post selection inference framework provides a simple way to make such in¬ 
ferences on a surprising variety of selection procedures. Confidence intervals are 
constructed for the selected signals and their widths depend, inter alia, on the 
nature of the selection procedure via the A and b matrices characteristic of the 
procedure. One can ask how interval widths compare for different procedures. 
Although the three procedures do not necessarily (rarely, in fact) compute con¬ 
fidence intervals for the same selected set - or even have the same target for 
the selected variable they do share - we would still prefer a procedure that pro¬ 
duces narrower intervals. The post selection framework provides guarantees on 
coverage, so we do not consider this here. We focus solely on relative interval 
widths, via the performance measure described in the previous subsection. 



Number of clusters selected Number of clusters selected 


Figure 7: Heatmap of relative confidence width performance measures as func¬ 
tion of the initially extracted clusters (vertical axis) and those selected by the 
second stage lasso (horizontal axis). Dataset is the block diagonal X with the 
paired signal ft configuration. /3* = 2. Same setup as Figure [7} 

Figures [7] and [8] show heat maps of the relative confidence interval width 
performance of the protolasso procedure against the other two methods. Fig¬ 
ure [7] uses a block diagonal matrix, while Figure [8] uses a single block matrix. 
The signal size is large (/?* = 2) so as to eliminate variable selection issues (all 
methods usually select all signals into their entertained sets) and focus only on 
the subsequent confidence interval widths. Some observations: 
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Figure 8: Heatmap of relative confidence width performance measures as func¬ 
tion of the initially extracted clusters (vertical axis) and those selected by the 
second stage lasso (horizontal axis). Dataset is the single block X with the tight 
signal (3 configuration. /3* = 2. Same setup as Figure [/| 


• The protolasso rarely constructs confidence intervals systematically much 
wider than the other two methods. Its worst performance occurs in the low 
correlation (p = 0.1) setting when we overselect the number of clusters. 
Protolasso seems to require a reasonably good estimate of the grouping 
structure for it to generate narrower intervals. 

• Protolasso’s relative performance improves markedly as p increases. At 
p = 0.7 (bottom rows, both figures), we see that intervals are in general 
shorter than both competitors, especially the lasso, in particular when we 
select roughly the correct number of signal clusters (vertical bands around 
3/4 selected clusters in Figure [7] and 1/2 in Figure|8|. 

• Note that Figure [8] has the tight distribution of the signal, all on variables 
correlated with each other. The lasso might select all the variables here, 
given the signal size. However, the high correlation amongst them (say 
when p = 0.7) causes some numerical instability in the constraint matri¬ 
ces A lasso and b lasso . This leads to wider intervals. Protolasso does not 
suffer from this. It selects only one of these variables (as they are all pre¬ 
sumably in the same cluster) and its constraint matrices are not affected 
by the large correlation. However, all signals are still in the entertained 
set and we can still compute confidence intervals for them. 

Overall then, protolasso performs admirably in the face of increased cor¬ 
relation, but does comparatively less well in low correlation settings where we 
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estimate too many clusters in the original clustering. A similar pattern to the 
entertained proportion performance then. 


6 Marginal testing: Prototest 

In the previous sections, we focussed on the regression problem, applying the 
lasso to the selected prototypes from clustering of the features, a procedure we 
called Protolasso. Here we briefly discuss the simpler problem of marginal 
testing. 

The first two steps in this method are the same as in Protolasso: we 
cluster the features, and choose a representative prototype from each cluster, 
being the one most correlated with the outcome. But instead of fitting the lasso 
to these prototypes, we carry out marginal hypothesis testing of each one. This 
inference is based on the selection-adjusted p-values using ©, © to account 
for they choice or prototype. We do not include ©, © , since the lasso is not 
used. We call the resulting method the prototest procedure. P-values for 
non-prototypes can be derived in much the same way as we did in protolasso. 

Figure [9] shows an example. We generated data from a standard Gaussian 
linear model: there were 10 groups of 6 features each, having within-group 
pairwise correlation of 0.7; 3 of the groups having a single signal feature. We 
orthogonalized the features of the other 7 group with respect to the 3 groups. 
The p-values over 200 simulations for the non-null and null groups are shown 
in the Figure. We see that the non-null p-values exhibit reasonable power while 
the null p-values are very close to uniform, as predicted by our theory. 

non-null null 




Figure 9: Prototest: Quantile-quantile plots of non-null and null p-values for 
simulated data example. 

Note that we have implicitly considered “non-null’ any cluster containing 
a signal feature, even if the prototype chosen for that cluster is not the signal 
feature. This seems reasonable, since the features in each cluster are highly 
correlated. 
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The basic definitions of a false positive call and the false discovery rate, 
become murky when the items (feature) under consideration are correlated. 
This very issue is the central topic, for example, of the paper by |G’Sell et al.[ 
(2014). The issue is as follows. In a regression setting for example, suppose that 
two predictors Xi and X 2 are very highly correlated but only X\ has a non-zero 
population coefficient. If we apply a section procedure and it chooses X 2l is this 
a false positive selection? With standard definitions of false positive and FDR, 
it would be. But intuitively, it seems like it should be considered a true positive. 


G’Sell et al. (2014) propose an alternate definition of false positive rate and FDR 


to address this issue, what they denote the UVR (uninformative variable rate). 
We will not discuss the details here. But we note that by considering correlated 
clusters as non-null or null, we skirt this issue. 

Using this idea, we computed the false positive rates for our simulated sample 
in the left panel of Figure [lOj We see that the false positive rates for the 
null clusters following their expectation closely. In the right panel we applied 
standard Benjamini-Hochberg FDR control to these p-values . We see that the 
FDR is indeed controlled at the nominal level. Since the p-values here are not 
exactly independent, the BH finite-sample control does not follow. However the 
asymptotic FDR control results of Storey et al. (2004) will likely be applicable. 




Figure 10: For the same simulation setup of Figure^ the left panel shows that 
the false positive rates for the null clusters follow their expectation closely. In 
the right panel the standard Benjamini-Hochberg FDR controlling procedure has 
been applied, for various nominal FDR values a. The achieved FDR shown and 
lies below the 45° line (broken), as it should. 

In many testing situations, especially in computational biology, the outcome 
is y is binary, for example case-control status. Then the Gaussian linear model 
is not appropriate. In that setting we can extend the polyhedral lemma and 
prototest to yield asymptotically valid selection-adjusted p-values. We give 
some details next. 

Consider a likelihood-based regression model with natural parameter r) = 
Xfi and log-likelihood l(f3). Let [/(/ 3) and X(/3) be the sample score vector and 
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information matrix, and Z7(/3) and X(/3) be the population versions. Assuming 
the existence of some true coefficient vector (3*, under standard asymptotics 
with p fixed and n —> oo, we have 

U(F)~N(0,l(pj). (7) 

In finite samples, we can use a normal approximation to the score. The usual 
score test enters the variable j\ with sign ,s-| that maximizes s ■ I7, (0)/X, (0) 1 '' 2 
over all variables j and signs s. Suppose that wish to test the hypothesis /3jj = 0, 
by testing that Uj 1 (0) = 0. To do this, we use the approximation 

17(0) ~JV(0,1(0)), (8) 


and apply the polyhedral selection and inference lemmas, with constraints of 
the form 

sic£ Uj{0) > ±cjUj (0) for all j ^ j x (9) 

Above, Cj contains Ij{ 0) _1//2 in the jth component and is 0 in all other compo¬ 
nents. The prototype selection in prototest would choose the feature with the 
largest value of the score test, and the above allows us to compute selection- 
adjusted p-values. 

Next we consider the microarray example from Subramanian et al. (20051 
and Efron & Tibshirani (2007): there are p — 10,100 genes and N = 50 arrays, 
relating to cell lines with normal or mutated states for the p53 factor, n\ = 17 
normal and n-i = 33 mutated.One We coded this as y = 0,1. Subramanian 


et al. (2005) curated several collections of genesets, representing organization 
of genes into functional categories. Here we consider the “C2” grouping into 
522 overlapping gene sets, representing cell pathways. As in |Efron &; Tibshi-| 
(2007), we restrict attention to the 395 sets having > 10 members. 


ram 


The 

approaches of Subramanian et al. (2005) and Efron & Tibshirani (2007) sum¬ 
marize the genesets with some overall score, and then use a permutation test 
to assess the overall significance. Here we apply prototest, choosing the most 
representative gene from each gene set, and then compute exact p-values to ac¬ 
count for the selection. Figure [Tl] shows the results of applying the BH rule at 
various FDR levels. Qualitatively the number of significant sets is in line with 
the analysis found in|Efron & Tibshirani (2007). Due to space constraints, we 
defer a more in-depth study of prototest for the logistic and other generalized 
linear models to a future paper. 


7 FDR control for the protolasso procedure 


In this section we adapt the knockoff framework of [Barber fe Candes| ( |2014| to 
the protolasso method, in order to obtain False Discovery Rate (FDR) control. 
FDR was first introduced in Benjamini & Hochberg (1995) in a multiple 


testing framework. The authors define the FDR: 


E 


V 


max{i?, 1} 
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Figure 11: p53 data: number of significant gene sets found at various FDR 
levels. 


where R is the number of the (multiple) null hypotheses rejected and V the 
number of those rejected incorrectly. The introduction of this controlling mea¬ 
sure - and their procedure guaranteed to control it below some predetermined 
proportion q - changed the manner in which practitioners considered multiple 
testing. 

The seminal paper has spawned much subsequent research. One paper of 
particular interest is that of Barber & Candes (2014), which introduces a vari¬ 
able selection technique guaranteed to control FDR. In a variable selection set¬ 
ting, one would associate R with the number of variables selected and V with 
the number of null variables selected by the procedure. Their procedure is called 
knockoff screening and proceeds in a sequence of steps: 


• Formation of the “knockoff” matrix X of the predictor matrix X. 
This matrix is constructed so that X T X = X T X and X T X = X T X — 
diag(s). So the knockoff matrix preserves all the correlations of the original 
matrix, both amongst its own variables and those of the original variables, 
but reduces the correlation between knockoffs and their original counter¬ 
parts. The authors describe how to construct X and propose how to select 
the 0 < Sj < 1. 

• Construction of statistics for each pair of original and knockoff 
variables. Each pair gets its own statistic Wj , j = 1, 2,... ,p. If W 7 is 
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large and positive, then there is evidence against the hypothesis that this 
variable is null. 


• Definition of data dependent threshold, T = min 

where W = {\Wj\ : j = l,...,p} \ {0}. 


t G W 


t) 


< q 


• Select features with Wj > T to control FDR at q. 

The shrewdness of the method is in the manner in which they construct the 
knockoffs and the statistics Wj. They present many examples of potential Wj 
constructions, but note that their FDR result holds as long as the Wj obey two 
properties: 


1. Sufficiency property: Wj should depend on the data only via the 
(column concatenated) Gram matrix [ X X ] T [ X X ] and feature- 
response inner products [ X X ] y. 

2. Antisymmetry property: detailed in the reference. 


One example of valid Wj, mentioned in the paper, and used by us later in 
the section, is to fit a lasso regression of y on [ X X ], storing the largest 
value of the regularization parameter A such that the variable enters the model, 


i.e. 


Zj = sup{A : /3y (A) ^ 0} 


with the knockoff equivalent Zj similarly defined. Setting Wj = Zj — Zj achieves 
the desired properties. 

Proof of their results follows from a super-martingale argument which flows 
quite elegantly, considering the obviously complicated distributional properties 
of the Wj and T. In fact, the crux of the result hinges on a succession of lemmas 
(numbers 1, 2 and 3 in the reference) that establish certain exchangeability 
results, making the rest of the analysis essentially independent of the underlying 
distribution of the Wj. 

In the remainder of this section, we describe a method modifying the knockoff 
procedure to operate at the prototype level, meshing it nicely with the initial 
clustering-prototyping step championed in this paper. We design the procedure 
to allow replication of Lemmas 1, 2 and 3 of Barber & Candes (2014), hence 


establishing FDR control for our procedure at the prototype level. The section 
concludes with a brief experiment demonstrating an application of our knockoff 
method to Protolasso. 


7.1 A Knockoff procedure for protolasso 

One wonders whether the prototyping-clustering and knockoff procedures can be 
seamlessly interleaved. Suppose we have column centered and standardized the 
predictor matrix A'. Recognizing that the construction of knockoff X ensures 
A t X = X T X, one realizes that any clustering based on the correlation metric 
detailed above produces the same clustering on both sets of columns. An initial 
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instinct is to cluster the columns of X (and, by extension, for X ) and then 
finding maximal marginal correlation prototypes for each separately, delivering 
prototypes P = {P\, Pz,... Pk} for the K clusters of the columns of X and 
similarly P = {Aj A , ■ ■ ■, Ac} for A'. Note that the separate computation of 
the two sets of prototypes does not guarantee that Pj = Pj , despite the initial 
clustering being the same. This is because the maximal marginal correlation 
column (with response y ) in a given cluster need not be the same in X as it is 
in X. The knockoff procedure alters correlations with the response. One would 
then venture to proceed by forming matrices Xp and Xp an d performing the 
knockoff procedure as described by Barber & Candes (20141 on the response 
y with augmented matrix [ Xp Xp J. Although we believe this procedure 
could control FDR, we found in experiments that it has very low power (ability 
to detect prototypes in clusters containing signal variables). We suspect that 
this has to do with the separate selection of prototypes in the original and 
knockoff matrices, especially in matrices where we encounter high correlation. 
In high correlations, prototypes are less likely to be the actual signal variables 
from a cluster (since every variable is a good surrogate for all the others and 
non-signal variables may present as having high correlation with the response). 
Also, we choose the most correlated knockoff variable from the knockoff cluster, 
which reduces the apparent strength of the original variable in the subsequent 
computation of Wj (even if this original variable comes from a signal cluster). 
We see fewer large positive W 7 and knockoff screening tends not to select any 
variables. In sum: prototypes have sight of y before subsequent analysis and 
selected prototypes tend to mismatch over original and prototype matrices: Pj ^ 
Pj. This reduces power. 

Our procedure addresses each of these shortcomings and orders steps in 
such a way to ensure the exchangeability lemmas of Barber & Candes (20141 
still hold. The procedure is illustrated in Figure |T2| and detailed as follows: 


Knockoff procedure for protolasso 


1. Cluster columns of X and select number of clusters K , producing clus¬ 
tering {Ci, C 2 ,... ,C A '}. 

2. Split the data by rows into two (roughly) equal parts: y = 

( A« 

and A = I x(2) 



3. Prototype clusters via maximal marginal correlations, as before, using 
only y W and X^ 1 ). This yields prototype set P = {A, A, • ■ •, Pk}- y ^ 
and XPl are now excluded from further analysis. 


4. Form knockoff matrix X ( ' 2> from A' i) as described in Barber fe Candes 


(2014|. It is essential that we form the knockoff matrix here using all 
columns of A^ 2 b This is to ensure that the exchangeability results hold 
for our procedure. Details are in Appendix B. 
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5. Reduce to matrices Xp and Xp ; by selecting the relevant (and same) 
columns in X^ and X^ respectively. 

6. Proceed with knockoff screening using y < - 2 ' 1 and 


X 


( 2 ) 


xi 2) 


We show in the Appendix B how this procedure replicates the exchangeability 
lemmas in Barber & Candes (2014). Hence our procedure exactly controls FDR 
in finite samples, as summarized in the following lemma: 


Lemma 7.1 For any q £ [0,1], let Wj 2 \ j £ P be the statistics testing the 
knockoff-original pairs in the knockoff procedure for protolasso, where P = 
{Pi, P 2 , ■ ■ ., Pk} is the set of prototypes selected in the first stage of the protolasso 
procedure. Furthermore, let 


T = min 


1 + #{j £ P : W\ 2) < -t} 

t£W: -- ~ < q 

#{j e P : Wf ] > t} V 1 


where W = {|wj 2) | : j £ P}\{0}. Finally, let Sp = {j £ P : > T}. Then 


#{j ■ Pp. = 0 and j £ Sp} 

E - Si -_-—— < q 

[ #{J :jGSp}V 1 J 

where the expectation is taken over the distribution of yP\ which generates the 
prototype set P, and that of y^ 2 \ which generates the statistics 

In the next section we perform and experiment to study the power of the 
procedure. 


7.2 Simulation experiment 

The predictor matrices X ^ and X^l for the simulation experiment of this sec¬ 
tion are generated like the block diagonal type matrices generated throughout 
the paper. Each matrix is generated once, with n = 200 rows and p = 100 
columns, comprising of 10 blocks of 10 predictors each, with the pairwise cor¬ 
relation p = 0.5 in each block. 

We consider three configurations for /?. Most of the elements are set to 
zero. Each configuration has five clusters with non-zero coefficients, but different 
numbers of non-zero coefficients within a cluster. The three configurations are: 

1. Coefficients at indices j = 1,11,21,31,41 are set to /3*. Here we have one 
signal variable per cluster. 

2. Coefficients at indices j = 1, 2,11,12, 21, 22, 31, 32,41,42 are set to /3*. So, 
two signal variables per cluster. 
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Figure 12: Schematic of our knockoff strategy with sample splitting: Columns 
of original matrix are clustered using all of X. Solid vertical black lines delimit 
clusters. There are four clusters in the figure. Notice how the clusters are 
translated exactly to the knockoff matrix. Original matrix is then split into 
X^ and X^ and response into y^ and y^ (horizontal dotted line in top 
part of the figure). Prototypes are found for the clusters using only X^ and 
y ^. These prototypes are highlighted by coloured vertical bars. Notice how they 
too are translated directly over to the knockoff matrix. Knockoff matrix X^ 
is obtained from the whole X^ before reducing to prototypes. Notice that a 
knockoff o/A« is not created. 

3. Coefficients at indices j = 1,2,3,11,12,13,21,22,23,31,32,33,41,42,43 
are set to /3*. Three signals per cluster. 

Different numbers of signal variables in clusters are tried, because considerations 
from Appendix B suggest that having multiple signals, but only one prototype 
from a cluster, could challenge FDR control unless we construct the knockoff 
matrix X^ correctly. Our results seem to suggest we do have FDR control. 
Signal amplitude is varied, setting /3* = 1, 2,..., 9. 

At each setting for the coefficient vector (there are 3x9 = 27), we generate 
Si = 50 replications of and use it in conjunction with to find prototypes 
for the K = 10 clusters. At each of these Si = 50 realizations of the prototype 
set P, we run S 2 = 100 replications of the knockoff procedure for protolasso, 
computing the false discovery proportion (FDP) amongst the prototypes and 
the power - the proportion of prototypes selected that occur in a cluster with at 
least one signal variable. These two quantities are averaged over the S 2 = 100 
replications to provide estimates of FDR and average power at a given prototype 
set. FDR is controlled at q = 0.2. Figure [Til] shows boxplots of the estimated 
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FDR and average power over the Si = 50 prototype set realizations, as function 
of /3». 








Figure 13: Boxplots of average FDP (left panels) and average power (right 
panels) over Si = 50 different prototype set realizations for the knockoff proce¬ 
dure for protolasso (means in blue). Plotted as function of signal amplitude 
/3*. Data is a block diagonal type matrix with n = 200 rows and p = 100 
columns: 10 blocks of 10 variables each, p = 0.5. Top row: Single signal per 
cluster. Middle row: Two signals per cluster. Bottom row: Three signals 
per cluster. Red dotted horizontal lines drawn at FDR control level q = 0.2. 

Notice that the blue lines (our overall of FDR) are always well below the 
red FDR control line at q = 0.2. This despite one or two replications in con¬ 
figurations coming in above the threshold. This is probably due to variation in 
estimating the FDR at a given prototype realization. Average power increases 
with signal amplitude, but does so more slowly the more signal variables we put 
into clusters. This is probably since our prototyping procedure screens more 
signals at the prototyping stage, with the second stage lasso procedure (where 
we compute the Wj) involving variables with proportionately small amounts of 
signal. It might be more difficult for the true signal’s Zj to come to the fore 
above their knockoff counterparts Zj, leading to fewer true signal detections. 
Overall then, we seem to have FDR control, with reasonable signal detection 
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power. 


8 Real datasets 


In this section we apply the protolasso procedure to two real datasets, demon¬ 
strating example applications of the procedure. The two datasets are the data 
used in Conlon et al. (2003) and some data in HIV mutations. Neither dataset 
is wide p > n. This is not a problem for our method. It applies equally to 
datasets with n > p. 


8.1 Conlon et. al. data 


Figure [l4] plots the initial analysis of the Conlon et al. (2003). The data, initially 
used to demonstrate motif regression, has n = 287 rows and p = 195 columns, 
again mean centered and standardized (and permuted to get the correlated ones 
close together). Each column j represents a candidate word or ‘motif’, with Xij 
measuring how well a motif j is represented in a DNA sequence i. The response 
yi is protein binding intensity in DNA sequence i. 

Again, we apply the gap statistic to estimate the number of clusters. This 
gives K = 2, which is not wholly surprising considering the nature of correlation 
heat map in Figure [14]- notice the high correlation amongst all variables and 
little obvious grouping structure. Error variance is estimated once again via the 
least squares estimate a = 4.8485. 

Figure [l5] shows the CIs computed from the protolasso (using both the gap 
statistic estimated K = 2 and then again with K = 6) and the ordinary lasso 
procedures. The figure demonstrates how correlation can hamper inference. 
Notice how wide all the intervals are for the ordinary lasso. This undoubtedly 
has to do with the numerical instability inherent in the inverses computed for 
the lasso post selection constraint matrices. For the protolasso procedure with 
K = 2, we seem to select uncorrelated prototypes and the entertained set allows 
us to compute relatively narrow CIs for all variables, all of which include zero. 
When we have more prototypes in our second stage lasso (K = 6), the high 
correlation seems to widen our intervals again, but not as greatly as for the 
ordinary lasso. 

This dataset illustrates a good feature of the protolasso procedure - it se¬ 
lects only a few variables, breaking correlation where it can, leading to narrower 
CIs, but still allowing CIs to be computed for unselected variables. Still, it is 
not entirely immune to correlation if we specify an incorrect number of clusters. 
It should be noted that this dataset exhibits rather extreme correlation amongst 
its variables. 


8.2 NRTI data 


Rhee et al. (2003) study six nucleoside reverse transcriptase inhibitors (NRTIs) 
that are used to treat HIV-1. The target of these drugs can become resistant 
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Figure 14: Left panel: Heatmap of pairwise correlations amongst (rearranged) 
columns of Conlon et al. (2003) data. Note the absence of any distinct clustering 
structure. Most variables are quite highly correlated with one another. Center 
panel: Dendrogram of hierarchical clustering with complete linkage based on one 
minus the correlation matrix. Right panel: Plot of dendrogram merge heights 
against the number of clusters formed at that merge. The panel, combine with 
the adjacent dendrogram seems to suggest that there are very few clusters in the 
data. 


through mutation, and Rhee et al. compare a collection of models for predicting 
these drugs log susceptibility, a measure of drug resistance based on the location 
of mutations. The design matrix X comprises of n = 1057 rows and p = 217 
columns. Columns were centered and scaled to have t^-norm n. Columns were 
permuted to ensure that highly correlated columns were close to each other. 
Figure [TH] shows some plots in an initial analysis and clustering of the data. 
There is a clear strong cluster, with two or three smaller, weaker ones, but many 
uncorrelated variables. We expect there to be many clusters in the clustering, 
with the vast majority of them being singleton clusters. Indeed, the gap statistic 
suggests that there are 161 clusters in the data. However, we set the number to 
K = 150. This ensures that the visible clusters are captured. 

Since n > p, we can estimate the variance using the least squares variance 
estimate. This gives a = 0.2252. Given this estimate, we are free to run the 
protolasso and ordinary lasso procedures, selecting prototypes in the second 
stage and variables in the original matrix (respectively) via 10-fold cross vali¬ 
dation. We then construct confidence intervals for the variables selected by the 
lasso and for the prototypes selected by the second stage lasso in protolasso 
and for the other members of these selected clusters. These confidence intervals 
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are shown in Figure 17 


We notice that the two methods select very similar entertained sets, con¬ 
structing very similar confidence intervals for those variables selected. Some 
interesting observations: 


• Both detect variable 135 and give it an interval lying quite far above 
everything else. This variable occurs in a singleton cluster, so protolasso 
has nothing else to say about it. 

• The lasso selects one variable from the large, strong group on the far 

left (number 17) and constructs an interval for it [0.0034, 0.03111]. 

Protolasso does not consider this cluster at all. 

• Variables 65 - 68 occur in the same cluster. The ordinary lasso selects all 
but number 65 and constructs intervals for them, the interval for number 
68 including zero. Protolasso, on the other hand, gets at the same set of 
variables by selecting only 67 as a prototype, but then constructing inter¬ 
vals for all of them, because they are in the entertained set. Protolasso 
says the 95% Cl for variable 68 is [0.0567, 0.0915]. 

• Something similar happens for variables 69, 72 and 73. Lasso only selects 
72 and 73, while protolasso selects 72 as prototype and then gets at all 
of them via the entertained set. 


9 Discussion 


We have introduced a coherent procedure for clustering, prototyping and subse¬ 
quent analysis of datasets with groups of correlated variables. The biggest selling 
point of our procedure is our use of the post-selection framework of [ Lee et al.| 
(2014) to obtain exact p-values in subsequent (i.e. post prototyping) regres¬ 
sion and/or testing. Furthermore, we show how the recently proposed knockoff 
procedure of Barber & Candes (2014) can be adapted to our protolasso pro¬ 
cedure, guaranteeing FDR control with reasonable power to detect variables in 
true signal clusters. Subsequent research will include further development and 
study of the prototest procedure, as marginal testing of correlated features is 
an important and challenging problem, for example in computational biology. 
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Appendix A: A gap statistic for the automatic 
selection of the number of clusters in hierarchi¬ 
cal clustering 


To estimate the number of clusters, we use a gap statistic procedure very similar 
to that of Tibshirani et al. (2001). We give details here. Figure 18 reveals the 
intuition behind our idea. Each row of the plot presents results for a different 
dataset. Details are discussed in the caption. 

For each dataset with correlated variables, A', we present two dendrograms. 
The center panel shows the hierarchical clustering dendrogram for the original 
A. In each case we see fairly clearly the range of the heights at which to cut the 
dendrograms so as to produce an optimal clustering - we find a place where the 
vertical distances between nodes and their immediate children are large. If we 
plot the height at which dendrogram merges occur, as a function of the number 
of clusters formed with that merge (green curves, bottom panels), we should she 
a sudden decrease as we merge to the correct number of clusters. We see this 
in both panels. An immediate idea is to take the first difference of this green 
curve (guaranteed to be negative for linkages without inversions) and find the 
location of the largest decrease. 

However, even in the absence of correlated groups, the dendrogram height 
curve decreases sub-linearly. This is demonstrated by generating the matrix U, 
of the same size as X with elements 


Uij ~ unif(min{Ay, X 2 j, ■ ■ .X nj },max{X lj ,X 2 j,. ■. X nj }) 

Notice that U is of the same scale as X, but that the correlation between 
columns is broken. Another way to achieve a similar aim is simply to permute 
the columns of A, each with a different randomly selected permutation. The 
dendrogram produced by applying the same hierarchical clustering algorithm to 
U is shown in the top panels and the dendrogram merge heights are shown in 
the bottom panels in blue. Note the monotone, sub-linear decrease in the blue 
curves. These curves give us an indication of what the decrease in the curve 
looks like when there is no correlation and is a good foil to the behavior of the 
dendrogram curve of X. It is the difference in the behavior of these curves which 
provides information on the number of clusters. In particular, we have in mind 
a “gap” statistic of the form: 


G k = K - h 


x 


where k is the number of clusters formed at the most recent merge and hf 
and hf. are the heights of the merge producing this number of clusters in X 
and U respectively (green and blue curves in the bottom panels of Figure [l8|). 
Following |Tib shirani et al. (2001), we instead look at the expected value of this 
quantity (over the uniform distribution, fixing A): 


A' 


g k = E[h%] - h k 
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The expected value can be approximated using a Monte Carlo simulation 
with B replications of U being generated: U^\ U^ 2 \ ... U I ' B ' 1 from the uniform 
distribution above, then computing 

9k = h k W - h k ( 10 ) 

6=1 

Note that we can estimate the standard error of the gap statistic as 


s'e.(gk) 





where (if = 

From the figure it seems as though the true number of clusters reveals itself 
at the point where the gap between the uniform and original dendrogram heights 
suddenly increases. This happens because the dendrogram heights plunge sud¬ 
denly for the correlated set, but does not do so for the uniform dendrogram. 
A good way to detect this phenomenon is to consider the first difference of the 
gap statistic: 

dk = 9k — 9 k -1 

and estimating the number of clusters as 

k = argrnax fc c4 (11) 

The difference curves dk (for a single replication B = 1) are plotted in purple 
in the right panels of Figure [lS| Note the spikes in these curves at the true 
number of curves. 


Recovery performance of the gap statistic 

As a basic proof of concept to convince the reader of the merit of our cluster 
number estimation procedure, consider a small simulation study exacted on the 
two datasets described in the caption of Figure [18] We considered the grid of 
pairwise correlation values p = 0.1,0.2,... 0.9. For each of the values of p , we 
generated S = 100 replications of the pair of described datasets and applied our 
gap statistic procedure to each. We recorded the number of estimated clusters 
and a measure of cluster quality for each run. The measure of cluster quality is 
the Adjusted Rand Index of [Ran d (197l]) - a number between 0 and 1 measuring 
how well the achieved clustering matches the true underlying group structure. 
A value of 1 occurs when the clustering matches the truth exactly; a value of 0 
occurs when no pair of columns that share a group in the true grouping share 
any cluster in the clustering. 

We notice that the method is rather successful at recovering the correct 
clustering (both the numbers and cluster membership), especially at higher 
correlations (p > 0.5). At lower correlations (p < 0.3), there seems to be a 
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tendency to overestimate the number of clusters (especially in the dataset in 
the bottom row, with its many smaller clusters - a more difficult clustering to 
detect). This overestimation is not dire and is preferred to underestimation of 
the number of clusters. Since we pick only one prototype from each cluster, we 
would rather overestimate the clusters and include more prototypes in the next 
step than the alternative where we increase the chance of screening potentially 
useful signal variables at this stage. 


Appendix B: Knockoffs with prototypes and the 
exchangeability lemmas of Barber &; Candes 


Barber & Can des| (|2014| prove three lemmas serving as key ingredients in the 
super-martingale argument proving FDR control. In fact, the first lemma is 
the important property, while the latter two merely serve to aid in the proof 
of the first. As far as we can tell, this is the only roadblock preventing the 


FDR (and other results based on the super-martingale argument) in Barber & 


Candes (2014) from going through for our knockoff/prototypes procedure. We 


address these three lemmas and argue that they hold for our procedure as well. 
Lemma 1 is stated (with mild notation modifications) as it is encountered in 
B arber fe Candes| ( 2014| ). 


( 2 ) 

We introduce a change of notation. We let Wf ’ to represent the Wj intro¬ 
duced in Section [7] This is merely to emphasize that these W-statistics stem 
from our knockoff/prototype procedure and depend (mostly) on X^ 2 / The first 
lemma states that the signs of the II-v associated with null columns are i.i.d: 


Lemma 9.1 Let e £ {±1} A be a sign sequence independent ofW^ ..., w\f), 

with efc = +1 for all non-null prototype A and ek {±1} for null prototypes. 

Then 


{w[ 2 \wf\ ...,w^) = (ei • wl 2 \ e 2 ■ Wf\ ...,e K - W f) 


Since we assume fixed X , the initial clustering, and hence value of K , are also 
fixed. However, there is an extra layer of randomness in W^ than in Barber 
fe Candes (20141. This randomness is injected by the prototyping step and 
is generated by randomness in y^ x \ which determines the set P. Fortunately, 
y^\ and hence also P, is independent of y ( 2 \ with this latter random variable 


generating the randomness in the knockoff procedure as understood from Barber 


& Candes (2014). We proceed to prove the exchangeability lemmas conditional 
on event {P = P}, establishing Lemma 9.1 above conditionally given {P = P}. 


Simply integrating over the distribution of y ^ establishes the unconditional 
result as stated. 


The second lemma pertains to the pairwise exchangeability 
note that conditional on {P = P}, if we consider the lemma for 

of features. We 

y(2) 

-A p p 5 

we see that it goes through exactly as stated in 

Barber & Candes 

to 

O 

i— 1 

4^ 

). Since 


37 






























we construct the knockoff X^ from all of X^ and since Xp' 1 and Xp'* are 
formed from their respective sources by selecting the same columns, we are 
assured that the necessary correlation structure is retained. 

Lemma 3 pertains to pairwise exchangeability for the response. Again, con¬ 
ditional on {P = P}, we note that y^ ~ N(X l ' 2 ' > /3, u 2 I) and proof of the lemma 
requires 

x l £ T XW 0 = x% r X^/3 ( 12 ) 

where Xp9 and Xp^ are the k th columns of Xp' 1 and Xp^ respectively and Pk 
is one of the null columns in the original X. Here we note that it is essential 
to form the knockoff matrix X^ before reducing to prototype matrices. We 
require that 

Pi x{ pJ x 0 = PfipT x i 

hp up 


Noting that there may be signal variables amongst the non-prototypes (i.e. 
Pj 0 for some j P), we require the necessary correlation guarantees from 
constructing X 1 ' 2 ' 1 from the entire X^ 2 \ Our procedure as stated computes the 
knockoff at the correct junctur e and the proof of the third lemma (conditionally 


on {P = P}) follows as in B arber fe Ca ndes (2014). So, conditionally on 
{P = P} we have that Lemma 9.1 holds and hence also holds unconditionally, 
as argued above. 

Notice that this FDR guarantee holds for the prototypes individually. We 
are guaranteed an upper limit on the number of false nulls selected that also 
happen to be prototypes. If we redefine our focus to “selecting” clusters rather 
than individuals prototypes, defining selection success as selecting a prototype 
from a cluster with at least one signal variable, we see that FDR control still 
holds. 
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Figure 15: Top panel: 95% post selection confidence intervals for variables 
in the entertained set of the protolasso procedure applied to the \Conlon et ai 
(2003) data. Selected prototypes have red intervals; other members have grey 
intervals. Number of clusters in initial clustering set to K = 2, while regular¬ 
ization parameter in second stage lasso on prototypes selected via 10-fold cross 
validation. Clustering structure represented by coloured crosses close to hori¬ 
zontal axis. Center panel: Same as top panel, but with K = 6. Bottom 
panel: 95% post selection confidence intervals for variables selected by the or¬ 
dinary lasso applied to the Conlon et al. 2003) data. Regularization parameter 
selected using 10-fold cross validation. 
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Figure 16: Left panel: Heatmap of pairwise correlations amongst (rearranged) 
columns of NRTI data. Note the one large, strong cluster, with two or three 
smaller, weaker clusters, with a smattering of pair clusters elsewhere. Center 
panel: Dendrogram of hierarchical clustering with complete linkage based on one 
minus the correlation matrix. Right panel: Plot of dendrogram merge heights 
against the number of clusters formed at that merge. The panel, combine with 
the adjacent dendrogram seems to suggest that there are many clusters in the 
data, most of which will be singleton clusters. 
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Figure 17: Top panel: 95% post selection confidence intervals for variables in 
the entertained set of the protolasso procedure applied to the NRTI data. Se¬ 
lected prototypes have red intervals; other members have grey intervals. Number 
of clusters in initial clustering set to K = 150, while regularization parameter in 
second stage lasso on prototypes selected via 10-fold cross validation. Clustering 
structure represented by coloured crosses close to horizontal axis. Singleton clus¬ 
ters get the same colour to reduce clutter. Bottom panel: 95% post selection 
confidence intervals for variables selected by the ordinary lasso applied to the 
NRTI data. Regularization parameter selected using 10-fold cross validation. 
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Figure 18: Dendrogram plots and plot of dendrogram cut heights against num¬ 
ber of clusters present at that height. Left column: X has n = 50 rows and 
p = 100 columns, divided into 10 clusters. Columns within clusters share pair¬ 
wise correlations of p = 0.5. Columns in different clusters are uncorrelated. 
Right column: X has n = 50 rows and p = 100 columns. One cluster of 
size 20 has pairwise correlation p = 0.5; )0 further pairs of correlated columns 
make up the rest. Top panel in each column: Hierarchical clustering den¬ 
drogram (complete linkage) for a matrix U, where each column has uniformly 
generated replicates in the range between the minimum and maximum values of 
the corresponding column of X. Centre panel of each column: Hierarchical 
clustering dendrogram (complete linkage) for the matrix X. Bottom panel of 
each column: Plot of dendrogram cut heights against the number of clusters 
present at that height. Purple curves have their scales on the right hand vertical 
axis. These represent the first different of the curve obtained from the differ¬ 
ence between the blue and green curves (the “gap”). Note the large spikes in 
the purple curves around the true number of clusters. Red horizontal lines are 
representative cuts that would produce the correct clustering. Vertical dashed 
grey lines show the true number of clusters. 
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Figure 19: Top row: Dataset as in left column of Figure 18 Left panel: 


Boxplots over S = 100 replications of the estimated cluster count using complete 
linkage and a gap statistic with B = 100 Monte Carlo simulations, as a function 
of the pairwise correlation amongst columns p. Horizontal red line shows the 
true number of clusters. Right panel: Boxplots over same S = 100 replications 
of the adjusted Rand Index of the clusterings so obtained, also as a function of 
p. Bottom row: Dataset the same as in the bottom row of Figure |16’[ Left 
panel and right panel: Estimated cluster counts and adjusted Rand Indices, 
as in the top row, using single linkage hierarchical clustering. 
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